Morphological knowledge and alignment of English-German parallel corpora

نویسندگان

  • Patrick Tschorn
  • Anke Lüdeling
چکیده

Alignment is an important step to linguistically exploit parallel corpora. In this paper we introduce a morphological component that improves the alignment of German-English parallel texts and helps find correspondences between morphological elements on the sub-word level. This paper deals with a small aspect of an alignment system, namely the improvement of a dictionary-based distance measure through a morphological analyser. What is alignment? For the purposes of this paper we define a bilingual parallel text as a text (L1) and its translation (L2). A sentence level alignment then maps groups of L1-sentences to corresponding groups of L2-sentences. These groups are often called "beads". An alignment can be viewed as a sequence of beads that covers the entire parallel text. While most beads usually express the correspondence between a single L1-sentence and a single L2-sentence, other types of beads arise when sentences are split, merged, deleted, added or changed in order by the translator. Each sentence belongs to exactly one bead. To illustrate some of the difficulties, consider the following excerpt from the very beginning of ‘The War of the Worlds’ parallel text:

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Aligning Parallel Bilingual Corpora Statistically with Punctuation Criteria

We present a new approach to aligning sentences in bilingual parallel corpora based on punctuation, especially for English and Chinese. Although the length-based approach produces high accuracy rates of sentence alignment for clean parallel corpora written in two Western languages, such as French-English or German-English, it does not work as well for parallel corpora that are noisy or written ...

متن کامل

Bilingual Sentence Alignment Based on Punctuation Marks

We present a new approach to aligning English and Chinese sentences in parallel corpora based solely on punctuations. Although the length based approach produces high accuracy rates of sentence alignment for clean parallel corpora written in two Western languages such as French-English and German-English, it does not fair as well for parallel corpora that are noisy or written in two distant lan...

متن کامل

Minimally supervised lemmatization scheme induction through bilingual parallel corpora

We present a lemma induction scheme on a target language through minimally supervised alignment and transfer methods utilizing English-to-German parallel corpora. Compared to previous alignment and transfer approaches, the approach outlined here increases computational efficiency and significantly reduces the level of supervision necessary in inducing clusters of inflectional forms. Furthermore...

متن کامل

Projecting Temporal Annotations Across Languages

This thesis investigates the use of parallel corpora for the annotation of temporal objects and relations. In particular, we employ existing tools for the temporal analysis of English to annotate the English portion of an English-German bitext, and automatically project these annotations to the German text, guided by word alignment. Projection-based approaches to multilingual annotation have pr...

متن کامل

Automatic creation of WordNets from parallel corpora

In this paper we present the evaluation results for the creation of WordNets for five languages (Spanish, French, German, Italian and Portuguese) using an approach based on parallel corpora. We have used three very large parallel corpora for our experiments: DGT-TM, EMEA and ECB. The English part of each corpus is semantically tagged using Freeling and UKB. After this step, the process of WordN...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2003